Algorithm-based Fault Tolerance for Floating-point Operations in Massively Parallel Systems
نویسنده
چکیده
This paper considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientiic computation. Existing ABFT schemes can provide eeective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to oating-point operations on large systems. This paper proposes the use of a partitioned linear encoding scheme to provide scalab-ility. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes. It is shown that the partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.
منابع مشابه
An approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملA new fixed degree regular network for parallel processing
We propose a family of regular Cayley network graphs of degree three based on permutation groups for design of massively parallel systems. These graphs are shown to be based on the shuffle exchange operations, to have logarithmic diameter in the number of vertices, and to be maximally fault tolerant. We investigate different algebraic properties of these networks (including fault tolerance) and...
متن کاملDeadlock - Free : Fault Tolerant Wormhole Routing in Mesh based Massively Parallel Systems *
In this paper we present a routing scheme which is extremely suited for use in massively parallel systems . The routing algorithm is fault-tolerant so that network failures will not stop the system. For reasons of scalability, the routing information is extremely compact, also when the network is injured . The wormhole routing technique guarantees very low routing latency . Moreover the routing...
متن کاملA Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems
A novel architecture for a software-implemented fault-tolerance layer for application reliability on massively parallel computers and distributed computing systems is proposed. This is the rst attempt at providing a purely software-based, user-level solution for fault detection, reconnguration, and recovery in a parallel environment. The symmetrically distributed, multi-tiered layer envelopes u...
متن کاملReversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs
Reversible logic is one of the new paradigms for power optimization that can be used instead of the current circuits. Moreover, the fault-tolerance capability in the form of error detection or error correction is a vital aspect for current processing systems. In this paper, as the multiplication is an important operation in computing systems, some novel reversible multiplier designs are propose...
متن کامل